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Abstract. Data sets are often modeled as samples from a probability distri- 
bution in R^, for D large. It is often assumed that the data has some inter- 
esting low-dimensional structure, for example that of a d-dimensional mani- 
fold A^, with d much smaller than D. When JV[ is simply a linear subspace, 
one may exploit this assumption for encoding efficiently the data by project- 
ing onto a dictionary of d vectors in (for example found by SVD), at a 
cost {n -\- D)d for n data points. When M. is nonlinear, there are no "ex- 
plicit" and algorithmically efficient constructions of dictionaries that achieve 
a similar efficiency: typically one uses either random dictionaries, or dictio- 
naries obtained by black-box global optimization. In this paper we construct 
data-dependent multi-scale dictionaries that aim at efficiently encoding and 
manipulating the data. Their construction is fast, and so are the algorithms 
that map data points to dictionary coefficients and vice versa, in contrast with 
L-'^-type sparsity-seeking algorithms, but alike adaptive nonlinear approxima- 
tion in classical multiscale analysis. In addition, data points are guaranteed 
to have a compressible representation in terms of the dictionary, depending on 
the assumptions on the geometry of the underlying probability distribution. 



1. Introduction 

We construct Geometric Multi-Resolution Analyses for analyzing intrinsically 
low- dimensional point clouds in high-dimensional spaces, modeled as samples from 
a probability distribution supported on d-dimensional set M (in particular, a man- 
ifold) embedded in R^, in the regime d <^ D. This setting has been recognized 
as important in various applications, ranging from the analysis of sounds, images 
(RGB or hyperspectral, P), to gene arrays, EEG signals [2 , and other types of 
manifold- valued data [3], and has been at the center of much investigation in the 
applied mathematics [H [5l and machine learning communities during the past 
several years. This has lead to a flurry of research on several problems, old and new, 
such as estimating the intrinsic dimensionality of point clouds [Zl [3 HI [TOl HT] [12] , 
parametrizing sampled manifolds [H |13l [HI |T5l [161 113 [El IIHI [20] , constructing dic- 
tionaries tuned to the data 2T1[22] or for functions on the data [2J, 24, 25,^6], and 
their applications to machine learning and function approximation [27, 22\ EQ] . 

We focus on obtaining multi-scale representations in order to organize the data in 
a natural fashion, and obtain efficient data structures for data storage, transmission, 
manipulation, at different levels of precision that may be requested or needed for 
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particular tasks. This work ties with a significant amount of recent work in different 
directions: (a) Harmonic analysis and efficient representations of signals; (b) Data- 
adaptive signal representations in high dimensional spaces and dictionary learning; 
(c) Hierarchical structures for organization of data sets; (d) Geometric analysis of 
low-dimensional sets in high-dimensional spaces. 

Harmonic analysis and efficient representations of signals. Represen- 
tations of classes of signals and data have been an important branch of research 
in multiple disciplines. In harmonic analysis, a linear infinite-dimensional function 
space T typically models the class of signals of interest, and linear representations 
in the form / = OLi(\>i^ for f G T in terms of a dictionary of atoms ^ := C 
are studied. Such dictionaries may be bases or frames, and are constructed so that 
the sequence of coefficients {ai}i has desirable properties, such as some form of 
sparsity, or a distribution highly concentrated at zero. Requiring sparsity of the 
representation is very natural from the viewpoints of statistics, signal processing, 
and interpretation of the representation. This, in part, motivated the construction 
of Fourier-like bases, wavelets, wedgelets, ridgelets, curvelets etc... [3T1[32^ 33 , just 
to name a few. Several such dictionaries are proven to provide optimal representa- 
tions (in a suitably defined sense) for certain classes of function spaces (e.g. some 
simple models for images) and/or for operators on such spaces. While orthogonal 
dictionaries were originally preferred (e.g. [34]), a trend developed towards over- 
complete dictionaries (e.g. frames [34l ES] and references therein) and libraries of 
dictionaries (e.g. wavelet and cosine packets [31 , multiple dictionaries [36j, fusion 
frames [37J), for which the set of coefficients {ai)i needed to represent a signal / 
is typically non-unique. Fast transforms, crucial in applications, have often been 
considered a fundamental hallmark of several of the transforms above, and was 
usually achieved through a multi-scale organization of the dictionaries. 

Data- adaptive signal representation and dictionary learning. A more 
recent trend [33, 38, 21, 39, 40, 22 , motivated by the desire to model classes of 
signals that are not well- modeled by the linear structure of function spaces, has 
been that of constructing data- adapted dictionaries: an algorithm is allowed to see 
samples from a class of signals T (not necessarily a linear function space), and 
constructs a dictionary ^ := that optimizes some functional, such as the 

sparsity of the coefficients for signals in J^. The problem becomes being able to 
construct the dictionary typically highly over-complete, so that, given / G J^, a 
rapid computation of the "best" (e.g. sparsest) coefficients {ai)i so that / = ai<pi 
is possible, and {ai)i is sparse. The problem of constructing ^ with the properties 
above, given a sample {fn}n ^ is often called dictionary learning^ and has been 
at the forefront of much recent research in harmonic analysis, approximation theory, 
imaging, vision, and machine learning: see [38l EU [Ml HqI [22 and references therein 
for constructions and applications. 

There are several parameters in this problem: given training data from J^, one 
seeks ^ with / elements, such that every element in the training set may be repre- 
sented, up to a certain precision e, by at most m elements of the dictionary. The 
smaller / and m are, for a given e, the better the dictionary. 

Several current approaches may be summarized as follows [41] : consider a finite 
training set of signals Xn = {xi}^^^ C M^, which we may represent by a R^^^ 
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matrix, and optimize the cost function 



(1.1) 



1 

/„($) = -^^(xi,*) 



where <l> G 



is the dictionary, and i a loss function, for example 
£{x^ ^) := min — ^q^||^d + A||a||i 



(1.2) 



where A is a regularization parameter. This is basis pursuit [33] or lasso [42]. One 
typically adds constraints on the size of the columns of <l>, for example < 1 

for all i, which we can write as <l> G C for some convex set C. The overall problem 
may then be written as a matrix factorization problem with a sparsity penalty: 



where ||q^||i,i := J^ii i2 I^hmI- While for a fixed ^ the problem of minimizing over 
a is convex, and for fixed a the problem of minimizing over <l>'s is also convex, 
the joint minimization problem is non-convex, and alternate minimization methods 
are often employed. Overall, this requires minimizing a non-convex function over a 
very high-dimensional space. We refer the reader to [41j and references therein for 
techniques for attacking this optimization problem. 

Constructions of such dictionaries (e.g. K-SVD [21 , /c-fiats [22j, optimization- 
based methods [41j, Bayesian methods [39 ) generally involve optimization or heuris- 
tic algorithms which are computationally intensive, do not shed light on the rela- 
tionships between the dictionary size /, the sparsity of a, and the precision e, and 
the resulting dictionary <l> is typically unstructured, and finding computationally, 
or analyzing mathematically, the sparse set of coefficients a may be challenging. 

In this paper we construct data-dependent dictionaries based on a Geometric 
Multi-Resolution Analysis of the data. This approach is motivated by the intrinsi- 
cally low-dimensional structure of many data sets, and is inspired by multi-scale geo- 
metric analysis techniques in geometric measure theory such as those in [43l [44j , as 
well as by techniques in multi-scale approximation for functions in high-dimension 
[iniHB]. These dictionaries are structured in a multi-scale fashion (a structure that 
we call Geometric Multi-Resolution Analysis) and can be computed efficiently; the 
expansion of a data point on the dictionary elements is guaranteed to have a certain 
degree of sparsity m, and may be computed by a fast algorithm; the growth of the 
number of dictionary elements / as a function of e is controlled depending on geo- 
metric properties of the data. We call the elements of these dictionaries geometric 
wavelets^ since in some respects they generalize wavelets from vectors that analyze 
functions in linear spaces to affine vectors that analyze point clouds with possibly 
nonlinear geometry. The multi-scale analysis associated with geometric wavelets 
shares some similarities with that of standard wavelets (e.g. fast transforms, a ver- 
sion of two-scale relations, etc.), but is in fact quite different in many crucial 
respects. It is nonlinear, as it adapts to arbitrary nonlinear manifolds modeling the 
data space J^, albeit every scale-to-scale step is linear; translations or dilations do 
not play any role here, while they are often considered crucial in classical wavelet 
constructions. Geometric wavelets may allow the design of new algorithms for ma- 
nipulating point clouds similar to those used for wavelets to manipulate functions. 



(1.3) 
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The rest of the paper is organized as fohows. In Sec. [2] we describe how to 
construct the geometric wavelets in a multi-scale fashion. We then present our 
algorithms in Sec. [3] and illustrate them on a few data sets, both synthetic and 
real- world, in Sec. IH Sec. [5] introduces an orthogonal verison of the construction; 
more variations or optimizations of the construction are postponed to Sec. [H The 
next two sections discuss how to represent and compress data efficiently (Sec. [7j) 
and computational costs (Sec. [8]). A naive attempt at modeling distributions is 
performed in Sec. [9l Finally, the paper is concluded in Sec. [10] by pointing out 
some future directions. 

2. Construction of Geometric Multi-Resolution Analyses 

Let (A^, p, /i) be a metric measure space with fi a Borel probability measure and 
M C R^. In this paper we restrict our attention, in the theoretical sections, to 
the case when (A^, p, /i) is a smooth compact Riemannian manifold of dimension d 
isometrically embedded in R^, endowed with the natural volume measure; in the 
numerical examples, {Ai^p^ja) will be a finite discrete metric space with counting 
measure, not necessarily obtained by sampling a manifold as above. We will be 
interested in the case when the "dimension" d of A4 is much smaller than the 
dimension of the ambient space R^. While d is typically unknown in practice, 
efficient (multi-scale, geometric) algorithms for its estimation are available (see 
0, which also contains many references to previous work on this problem), under 
additional assumptions on the geometry of A4. 

Our construction of a Geometric Multi-Resolution Analyses (GMRA) consists of 
three steps: 

1. A multi-scale geometric tree decomposition of M into subsets {Cj^kjkeJCjjez- 

2. A (i-dimensional affine approximation in each dyadic cell Cj^k^ yielding a 
sequence of approximating piecewise linear sets {A^j}, one for each scale j. 

3. A construction of low-dimensional affine difference operators that efficiently 
encode the differences between Mj and A^j+i- 

This construction parallels, in a geometric setting, that of classical multi-scale 
wavelet analysis [3T, |47l HHl |49l |W: the nonlinear space M replaces the classi- 
cal function spaces, the piecewise affine approximation at each scale substitutes the 
linear projection on scaling function spaces, and the difference operators play the 
role of the classical linear wavelet projections. We show that when is a smooth 
manifold, guarantees on the approximation rates of M by the Mj may be derived 
(see Theorem l2.3l in Sec. l2.4p . implying compressibility of the GMRA representation 
of the data. 

We construct bases for the various affine operators involved, producing a hier- 
archically organized dictionary that is adapted to the data, which we expect to be 
useful in the applications discussed in the introduction. 

2.1. Tree decomposition. Let B^{x) be the p-ball inside M. of radius r > 
centered at x G Al . We start by a spatial multi-scale decomposition of the data set 
M. 

Definition 2.1. A tree decomposition of a d-dimensional metric measure space 
(M^p^fi) is a family of open sets in M, {Cj^k}ke}Cjjez, called dyadic cells^ such 
that 

(i) for every j e Z, p{M \ UkeKjCj^k) = 0; 
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(ii) for j' > j and k' G fCj' , either Cj/^w ^ Cj^k or jiiCj'^k' H Cj^k) = 0; 

(iii) for j < j' and k' G fCj' , there exists a unique k G fCj such that Cj/^w ^ Cj^u; 

(iv) each Cj^k contains a point cj^k such that B-^^^_j{cj^k) ^ Cj^k ^ (<^j,/c) ? 
for a constant ci depending on intrinsic geometric properties of M.. In 
particular, we have ii{Cj^k) ^ . 

The construction of such tree decompositions is possible on spaces of homoge- 
neous type [511 [521 [53] . Let T be the tree structure associated to the decomposition 
above: for any j G Z and /c G /Cj, we let children(jf, k) = {k' G /Cj+i : Cj-^i^k' ^ Cj,k}- 
Note that Cj^k is the disjoint union of its children Cj+i^fc/, k' G children(j, /c), due to 
(ii). We assume that /i(A^) ~ 1 such that there is only one cell at the root of the tree 
with scale log2d l~i{M.) = (thus we will only consider j > 0). For every x G A^, 
with abuse of notation we use (j, x) to represent the unique (j, /c(x)), /c(x) G JCj 
such that X G Cj^^^y The family of dyadic cells {Cj^k}keK:j at scale j generates 
a cr-algebra Tj. Functions measurable with respect to this a-algebra are piecewise 
constant on each cell. 

In this paper we will construct dyadic cells on i.i.d. /i-distributed samples {xi}^^i 
from A4 according to the following variation of the construction of diffusion maps 
[H [5l]: we connect each Xi to its /c-nearest neighbors (default value is k = 50), 
with weights Wij = K{xi^Xj) = e""^^"^-?" /^i^j ^ where is the distance between 
Xi and its A:/2-nearest neighbor, to obtain a weighted graph on the samples Xi (this 
construction is used and motivated in [55]). We then make use of METIS [56 to 
produce the multi-scale partitions {Cj^k} and the dyadic tree T above. In a future 
publication we will discuss how to use a variation of cover trees [57 , which has 
guarantees in terms of both the quality of the decomposition and computational 
costs, and has the additional advantage of being easily updatable with new samples. 

We may also construct the cells Cj^k by intersecting Euclidean dyadic cubes in 
with A4: if is sufficiently regular and so is its embedding in (e.g. A4 
a smooth compact isometrically embedded manifold, or a dense set of samples, 
distributed according to volume measure, from it), then the properties in Definition 
12. H are satisfied for j large enough. In this case, a careful numerical implementation 
is needed in order to not be penalized by the ambient dimensionality (e.g. [58j and 
references therein). 

Definition 2.2. We define 

(2.1) D{M) = {y eR^ :3\ X e M such that\\x - y\ \ = min \\x -y\\}, 

x'eM 

(2.2) tuhr{M) = {yeR^: d{y, M) < r} 
and, following H. Federer [59\, 

(2.3) reach(M) = sup{r > : tub^(M) C D{M)} . 
For X G reach(A^), let be the point in A4 closest to x. 

One may think of reach(AI) as the largest radius of a non-self-intersecting tube 
around A1, which depends on the embedding of M in R^. This notion has appeared 
under different names, such as "condition number of a manifold" , in recent manifold 
learning literature [601 EB ^ as a key measure of the complexity of M. embedded in 
R^. In our setting, we require positive reach(A^) only in order to obtain uniform 
estimates, but for local (or pointwise) estimates only require reach(5^(r)), or 
reach(A^ nB^(r)), for all r's sufficiently small (depending on z). 
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2.2. Multiscale singular value decompositions and geometric scaling func- 
tions. The tools we build upon are classical in multi-scale geometric measure the- 
ory [62] [BS*, "SS^ , especially in its intersection with harmonic analysis, and it is also 
related to adaptive approximation in high dimensions, see for example [45, 46, and 
references therein. An introduction to the use of such ideas for the estimation of 
intrinsic dimension of point clouds is in [8 and references therein (see [71 [64] for 
previous short accounts). 

We will associate several gadgets to each dyadic cell Cj^k^ starting with some 
geometric objects: the mean 

(2.4) c,, := E^[x\x e C,, J = — / xd^{x) e 

and the covariance operator restricted to Cj^k 

(2.5) cov,, = E^[(x - Cj^k){x - c,, fe)*k ^ ^ ^^""^ • 

Here and in what follows points in are identified with D-dimensional column 
vectors. For a prescribed dj^k (e.g. dj^k = d), let the rank-dj^/c Singular Value 
Decomposition (SVD) [65j of covj^^ be 

(2.6) coYj^k ^ ^j,k^j,k^lk^ 

where ^j^k is an orthonormal D x dj^k matrix and S is a diagonal dj^k x dj^k matrix. 
The linear projection operator onto the subspace {^j,k) spanned by the columns of 
^j^k will be denoted by Pj^k- We let 

(2.7) Yj,k := Vj^k + Cj^k , Vj^k = {^j,k) , 

where (A) denotes the span of the columns of A, so that Yj^k is the affine subspace 
of dimension dj^k parallel to Vj^k and passing through cj^k- It is an approximate 
tangent space to M at location cj^k and scale 2~^; and in fact it provides the best 
(ij^/e- dimensional planar approximation to A4 in the least square sense: 



(2.8) V,, = argmin / \\x - ¥u{x)\\' d^{x) , 

where 11 is taken on the set of all affine dj^/c-planes, and Pn is the orthogonal projec- 
tion onto the affine plane 11. We think of {^j^k}keiCj as the geometric analogue of a 
family of scaling functions at scale j, and therefore call geometric scaling functions. 
Let Pj^/c be the associated affine projection 

(2.9) Pj,fe(^) := Pj,k{x - Cj^k) + Cj^k = ^j,k^lk{x - Cj^k) + c^-,/c, x G Cj^k • 

Then Pj^/c(C'j,/c) is the projection of Cj^k onto its local linear approximation, at 
least for < reach(AI). 
We let 

(2.10) Mj := {Fj^k{Cj,k)}ke)C, 

be a coarse approximation of M. at scale j, the geometric analogue to what the 
projection of a function onto a scaling function subspace is in wavelet theory. Under 
general conditions, A4j A4 in the Hausdorff distance, as j ^ +oo. It is natural 
to define the nonlinear projection of A4 onto A4j by 

(2.11) Xj=PMAx):=rj^kix) , xeCj^k- 
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Figure 1. An illustration of the geometric wavelet decomposition. 
The centers Cj^x^s are represented as lying on A4 while in fact they 
are only close (to second order) to A^, and the corresponding planes 
Yj^x are represented as tangent planes, albeit they are only an 
approximation to them. Art by E. Monson. 



2.3. Geometric wavelets. We would like to efficiently encode the difference needed 
to go from A4j to for j > 0. Fix x G A4: the difference Xj-^i — Xj is a high- 

dimensional vector in R^, in general not contained in A^j+i. However it may be 
decomposed into a sum of vectors in certain well-chosen low-dimensional spaces, 
which are shared across multiple points, in a multi-scale fashion. Recall that we 
use the notation (j, x) to denote the unique pair (j, /c), with k G /Cj, such that 
X e Cj^k- We proceed as follows: for j < J — 1 we let 

QMj+i{x) := Xj^i - Xj 

= - P,, + (P,, - P,,^(x)) 

= (/ - Pj^x){Xj^l - Cj^x) + Pj^x{Xjj^i - X) 

(2.12) = - Pj,x){Xjj^i - Cj^i^x + - Cj^x) - Pj,x{x - Xj^i). 

Let Wj-^i^x be the geometric wavelet subspace defined by 

(2.13) Wj^i^x := (/ - Pj,x) Vj^i^x , 

"^j-^i^x an orthonormal basis for Wj-^i^x^ that we will call a geometric wavelet ba- 
sis, and Qj-\-i^x the orthogonal projection onto Wj-^i^x- Clearly dimWj-^i^x ^ 
dimV^+i^a; = dj-^i^x- If we define the quantities 

(2.14) tj-^i^x ' ^j-\-i,x ^j^x") 

(2.15) Wj-^i^x •= (^ — Pj,x) tj-\-i,x'', 

(2.16) Qj+i,x{x) := Qj^i,x{x - Cj^i^x) + , 
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then we may rewrite (j2.12p as 

J-1 



J-1 

E 

i=j+i 



+ >J {xi^i - Xi) 



J-1 

(2.17) = Q^-+i,^(x^-+i) - P^-,a, ^ Qmi^A^) - PjA^ - ^j)' 

Here J > j + 1 is the index of the finest scale (and the last term vanishes as 
J +00, under general conditions). In terms of the geometric scaling functions 
and wavelets, the above may be written as 

J-1 

Xj^l - Xj = ^^•+1,^^*+!^^ {Xj^i - Cj^iA + ^j+l,x - ^j,x^lx Yl + 

l=j+l 

(2.18) (x-xj). 

This equation splits the difference ^j+i — Xj into a component in Wj-^i^x^ a second 
component that only depends on the cell (j + 1, x) (but not on the point x per se)^ 
accounting for the translation of centers and lying in the orthogonal complement 
of Vj^x but not necessarily in Wj-\-i^x^ and a sum of terms which are projections on 
Vj^x of differences in the same form xi-^i — xi^ but at finer scales. By construction 
we have the two-scale equation 

(2.19) PM,+^ (x) = Pm, {x) + Qm.+i (^) , xeM 

which can be iterated across scales, leading to a multi-scale decomposition along 
low-dimensional subspaces, with efficient encoding and algorithms. We think of 
Pj^k as being attached to the node (j, /c) of T, and the Qj+i^^/ as being attached 
to the edge connecting the node (j + 1, k') to its parent. 

We say that the set of multi-scale piecewise affine operators {Pmj } and {Qmj+i } 
form a Geometric Multi- Resolution Analysis^ or GMRA for short. 

2.4. Approximation for manifolds. We analyze the error of approximation to 
a d-dimensional manifold in by using geometric wavelets representation. The 
following result fully explans of the examples in Sec. 14. 1[ 

Theorem 2.3. Let (A^, p, ji) he a compact C^^^ Riemannian manifold of dimension 
d isometrically embedded in , with a G (0, 1]; and fi absolutely continuous with 
respect to the volume measure on M. Let {Pmj , Qmj+i } be a GMRA for (A1, p, /i). 
For any x e there exists a scale jo = jo{x) such that for any j > jo and any 
p> 0, if we let djij^x •= lJi{CjA~^dii, 



lR^llLP(Cj,,,d/i,,x(2)) 



z - Pm,, {z) 



J-1 

l=jo 



(2.20) < ||k||l=o(c,,j 2-(i+")^' + o(2-(i+")^') . 

If a < 1, k{x) depends on the C^~^°^ norm of a coordinate chart from Tx{Ai) to 
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If a = 1, n{x) = min(/^i(x), n2{x)) ^with 



(2.21) 



ni{x) := - _ max \\Hi{x)\\] 

2 ie{l,...,D-d} 

(2.22) 



d{d^l) 



es^-d 4(d + 2)((i + 4) 



D-d 



^ wiHi{x) 



1=1 



m-d 



d^2 , 

F \ ^=1 



anc? the D — d matrices Hi{x) are the d-dimensional Hessians of M at x. 



This theorem describes the asymptotic decay of the geometric wavelet coeffi- 
cients as a function of scale, and in particular it implies the compressibility of such 
coefficients. The decay depends on the smoothness of the manifold, and for 
manifolds it is quadratic in the scale; it saturates at C^, and for smoother mani- 
folds we would have to use higher order geometric wavelets. We do not consider 
them here as the data sets we consider do not seem to benefit from higher order 
constructions. More quantitatively, the asymptotic rate is affected by the constant 
which combines the distortion of dfi compared to the volume measure, and a 
notion of curvature. Depending on the size of which in general varies from 
location to location, it gives an error estimate for an adaptive thresholding scheme 
that would threshold small coefficients in the geometric wavelet expansion (see the 
third example in Section 14. ip . 

Observe that 1^2 can be smaller than ki (by a constant factor) or larger (by 
factors depending on (i^), depending on the spectral properties and commutativity 
relations between the Hessians Hi. ^2 may be unexpectedly small, in the sense 
that it may scale as d~'^r^ as a function of d and r, as observed in [8j, because of 
concentration of measure phenomena. 

Finally, we note that similar bounds may be obtained in L^(Cj^x-,dYo\) simply 
by changing measure from dji to dvol and paying the price of replacing the con- 
stant K by 



dvol 



Hi. This may also be achieved algorithmically with simple 

standard renormalizations (e.g. 

The proof is postponed to the Appendix. 

It is clear how to generalize the Theorem to unions of manifolds with generic 
intersections, at scales small enough around a point so that Cj^x does not include 
intersections. Moreover, since the results are local, sets more general than manifolds 
may be considered as well: this is subject of a future report. 



2.5. Non- manifold data and measures of approximation error. When con- 
structing a GMRA for point-cloud data not sampled from manifolds, we may choose 
the dimension dj^k of the local linear approximating plane Vj,/c by a criterion based 
on local approximation errors. Note that this affects neither the construction of 
geometric scaling functions, nor that of the wavelet subspaces and bases. 
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A simple measure for absolute error of approximation at scale j is: 

\\PMj{x) -X\\lnd^{x) = / \\Pj^k{x) - X\\ln dfl\c^^,,{x) 

(2.23) = ^ /i(Q,fc) Mcoyj,k)^ 

We can therefore control Ej by choosing dj^k based on the spectrum of covj^k- If we 
perform relative thresholding of covj^ki i-e. choose the smallest dj^k for which 

(2.24) ^ Xi{covj^k) < €j ^ A/(cov^-,fc), 

for some choice of ej (e.g. cj = (cO^) V e for some G (0, 1) and e > 0), then we 
may upper bound the above as follows: 

(2.25) E KCM\Cj,k\\l < ^j\\\M\\\f . 

where Cj^k and M are thought of as matrices containing points in columns, and 
for a partitioned matrix A = [^41,742, . . . , A^] and discrete probability measure /i 
on {1, . . . , r} we define 

(2.26) |||A|||^=^MW)II^.|If- 

i=l 

If we perform absolute thresholding of covj^/c, i.e. choose the smallest dj^k for which 
X]^>d fc+i ^i{^^^j,k) ^ then we have the rough bound 

(2.27) £2 < J2 ^i{Cj,k)ej < ej ■ ^i{M). 

Of course, in the case of a d-dimensional manifold A4 with volume measure, 
if we choose dj^k = d^ by Theorem 12. 31 we have 

(2.28) < J2 MQ,fe)ll^lloo2-2^' ^ ^^M)M^2-^^. 

3. Algorithms 

We present in this section algorithms implementing the construction of the 
GMRA and the corresponding Geometric Wavelet Transform (GWT). 

3.1. Construction of Geometric Multi- Resolution Analysis. The first step 
in the construction of the geometric wavelets is to perform a geometric nested 
partition of the data set, forming a tree structure. For this end, one may consider 
various methods listed below: 

(I). Use of METIS [56]: a multiscale variation of iterative spectral partitioning. 
We construct a weighted graph as done for the construction of diffusion 
maps in [54]: we add an edge between each data point and its k near- 
est neighbors, and assign to any such edge between Xi and Xj the weight 
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GMRA = GeometricMultiResolutionAnalysis (Xn,ro,e) 
1 1 Input: 

/ / Xn : a set of n samples from J\A 

/ / tq\ some method for choosing local dimensions 

// e: precision 

// Output: 

//A tree T of dyadic cells {Cj,/c}, their local means {cj^k} and bases 
/ / together with a family of geometric wavelets {wj^k} 

Construct the dyadic cells Cj^k with centers {cj,^} and form a tree T. 
J ^ finest scale with the e-approximation property. 

Let covj,fc = \Cj,k\~^^^^Cj ~ cj,k){x - cj^kY , for k G /Cj, and compute 
SVD(covj,fc) = (where the dimension of ^j^k is determined by ro). 

for j — J — 1 down to 
for k G K,j 

Compute COY j^k and as above. 

For each k^ G children ( j, /c), construct the wavelet bases ^^+1,^/ and 
translations Wj^i^k', according to (|2.16p . (|2.13p . 

end 

end 

For convenience, set ^o,fe := ^o,fc and wo^k := co,/c for /c G /Co. 



Figure 2. Pseudo-code for the construction of geometric wavelets 

^-\\xi-xj\\'^ /a ^ Here k and a are parameters whose selection we do not dis- 
cuss here (but see [55 for a discussion in the context of molecular dynamics 
data). In practice, we choose k between 10 and 50, and choose a adaptively 
at each point Xi as the distance between Xi and its [k/2\ nearest neighbor. 
(II). Use of cover trees [57] , 

(III) . Use of iterated PC A: at scale 1, compute the top d principal components 

of data, and partition the data based on the sign of the {d + l)-st singular 
vector. Repeat on each of the two partitions. 

(IV) . Iterated /c-means: at scale 1 partition the data based on /c-means clustering, 

then iterate on each of the elements of the partition. 

Each construction has pros and cons, in terms of performance and guarantees. For 
(I) we refer the reader to [56 , for (II) to [57| (which also discussed several other 
constructions), for (III) and (IV) to [66\ Only (II) guarantees the needed properties 
for the cells Cj^k- However constructed, we denote by {Cj^k} the family of resulting 
dyadic cells, and let T be the associated tree structure, as in Definition 12. li 

In Fig. [2] we display pseudo-code for the construction of a GMRA for a data 
set Xn given a precision e > and a method tq for choosing local dimensions 
(e.g., using thresholds or a fixed dimension). The code first constructs a family of 
multi-scale dyadic cells (with local centers Cj^k and bases *^*j,/c), and then computes 
the geometric wavelets ^j^^ and translations Wj^k at all scales. In practice, we use 
METIS [56| to construct a dyadic (not 2^-adic) tree T and the associated cells Cj^k- 

3.2. The Fast Geometric Wavelet Transform and its Inverse. For simplicity 
of presentation, we shall assume x = xj; otherwise, we may first project x onto the 
local linear approximation of the cell Cj^x and use xj instead of x from now on. 
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{qj^^} =FGWT(GMRA,x) 

// Input: GMRA structure, x e M 

II Output: A sequence of wavelet coefficients 

for j — J down to 1 

Pj-l,x = {^*j-l,x^J,x)pj,x + ^*j-l,a:{cj,a: -C^-l,x) 

end 

qo,x = Po,x (for convenience) 



Figure 3. Pseudo-code for tfie Forward Geometric Wavelet Transform 
X =IGWT(GMRA,{(^^>}) 

// Input: GMRA structure, wavelet coefficients {qj,x} 
II Output: Approximation x at scale J 

Qj,x = ^J,xqj,x + Wj,x 

for j = J — 1 down to 1 

Qj(x) = ^j,xqj,x + + Y.i>j Qd^) 

end 

X = '^o,xqo,x + wo,x + Z^^>o Qj(^) 



Figure 4. Pseudo-code for the Inverse Geometric Wavelet Transform 

That is, we will define Xj-j = PMj{xj)^ for all j < J, and encode the differences 
Xj-^i-j — Xj.j using the geometric wavelets. Note also that \\xj.j — Xj\\ < \\x — xj\\ 
at all scales. 

The geometric scaling and wavelet coefficients {Pj,x}^ {Qj-\-i,x}^ foi" j > 0, of a 
point X G A4 are chosen to satisfy the equations 

(3.1) PMj (x) = ^j,xPj,x + Cj^x] 

J-1 

(3.2) Qmj+i {x) = "^j^i^xqj+i^x + Wj^i^x - Pj^x ^ {x). 

The computation of the coefficients, from fine to coarse, is simple and fast: since 
we assume x = xj, we have 

(3.3) = {^Ix^J.x) PJ,x + ^lx{cj,x - Cj^x)- 

Moreover the wavelet coefficients ^j+i,^ (defined in (|3.2p ) are obtained from (|2.18p : 

(3.4) qj+i^x = \E^*+i,^(x^-+i - Cj^i,x) = (^*+i,^<&^-+i,x) Pi+i,x. 

Note that ^j^^j^x and are both small matrices (at most dj^x x dj^x)^ 

and are the only matrices we need to compute and store (once for all, and only up 
to a specified precision) in order to compute all the wavelet coefficients <7j+i,x and 
the scaling coefficients pj^x^ given pj^x at the finest scale. 

In Figs. [3] and [H we display pseudo-codes for the computation of the Forward 
and Inverse Geometric Wavelet Transforms (F/IGWT). The input to FGWT is a 
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Figure 5. Toy data sets for geometric wavelets transform. 



GMRA object, as returned by GeometricMultiResolutionAnalysis, and a point 
X G A4. Its output is the wavelet coefficients of the point x at all scales, which are 
then used by IGWT for reconstruction of the point at all scales. 
For any x G A^j, the set of coefficients 



(3.5) 



Qx = {Qj,x'-, Qj-i^x', • • • ; Qi,x'-,Po,x) 



is called the discrete geometric wavelet transform (GWT) of x. Letting dj.^ = 
rank(^j+i^aj), the length of the transform is d -\- Xl^^o^j^a^' which is bounded by 
{J -\-l)d in the case of samples from a d-dimensional manifold (due to dj^j, < d). 

Remark 3.1. Note that for the variation of the GMRA without adding tangential 
corrections (see Sec. 16. 2p . the algorithms above (as well as those in Sec. [5j) can be 
simplified. First, in Fig.[2]we will not need to store the local bases functions {^j^k}- 
Second, the steps in Figs. [3] and H] can be modified not to involve similarly 
as in Figs. [T71 and [T8l of next section. 



4. Examples 

We conduct numerical experiments in this section to demonstrate the perfor- 
mance of the algorithm (i.e.. Figs. [21 [3l |4|). 

4.1. Low-dimensional smooth manifolds. To illustrate the construction pre- 
sented so far, we consider simple synthetic datasets: a SwissRoll^ an S-Manifold 
and an Os dilating 2D Wave^ all two-dimensional manifolds but embedded in R^^ 
(see Fig. [5]). We apply the algorithm to construct the GMRA and obtain the for- 
ward geometric wavelet transform of the sampled data (10000 points, without noise) 
in Fig.[6l We use the manifold dimension dj^k = <i = 2 at each node of the tree when 
constructing scaling functions, and choose the smallest finest scale for achieving an 
absolute precision .001 in each case. We compute the average magnitude of the 
wavelet coefficients at each scale and plot it as a function of scale in Fig. [6l The 
reconstructed manifolds obtained by the inverse geometric wavelets transform (at 
selected scales) are shown in Fig. [71 together with a plot of relative approximation 
errors. 



(4.1) 



crel 



\ 



n ^ 

xeXn 



Pj,x{x) 



\x\\ 



where Xn is the training data of n samples. Both the approximation error and the 
magnitude of the wavelet coefficients decrease quadratically with respect to scale 
as expected. 
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Figure 6. Top row: Wavelet coefficients obtained by the algo- 
rithm for the three data sets in Fig. [5l The horizontal axis in- 
dexes the points (arranged according to the tree), and the vertical 
axis multi-indexes the wavelet coefficients, from coarse (top) to 
fine (bottom) scales: the block of entries (x, j),x G Cj^k displays 
logio \Qj,x\^ where qj^x is the vector of geometric wavelet coefficients 
of X at scale j (see Sec. [3]). In particular, each row indexes multiple 
wavelet elements, one for each k G JCj. Bottom row: magnitude of 
wavelet coefficients decreasing quadratically as a function of scale. 

We threshold the wavelet coefficients to study the compressibility of the wavelet 
coefficients and the rate of change of the approximation errors (using compressed 
wavelet coefficients). For this end, we use a smaller precision 10~^ so that the 
algorithm can examine a larger interval of thresholds. We ffist threshold the wavelet 
coefficients of the Oscillating 2D Wave data at the level .01 and plot in Fig. [8] the 
reduced matrix of wavelet coefficients and the corresponding best reconstruction of 
the manifold (i.e., at the finest scale). Next, we threshold the wavelet coefficients 
of all three data sets at different levels (from 10~^ to 1) and plot in Fig. [9] the 
compression and error curves. 

4.2. Real data. 

4.2.1. MNIST Handwritten Digits. We first consider the MNIST data set of images 
of handwritten digital, each of size 28 x 28. We use the digits and 1, and randomly 
sample for each digit 3000 images from the database. Fig. [10] displays a small subset 
of the sample images of the two digits, as well as all 6000 sample images projected 
onto the top three PCA dimensions. We apply the algorithm to construct the 
geometric wavelets and show the wavelet coefficients and the reconstruction errors 
at all scales in Fig. [TTJ We select local dimensions for scaling functions by keeping 
50% and 95% of the variance, respectively, at the nonleaf and leaf nodes. We 
observe that the magnitudes of the coefficients stops decaying after a certain scale. 
This indicates that the data is not on a smooth manifold. We expect optimization 



Available at |http: //yann. lecun. com/exdb/mnist/ . | 
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scale = 4, error = 0.089256 scale = 4, error = 0.093612 ^ scale = 5, error = 0.20783 




Figure 7. Top and Middle: Reconstructions by the algorithm of 
the three toy data sets in Fig. [5] at two selected scales. Bottom: 
Reconstruction errors as a function of scale. 



Magnitude of Wavelet Coefficients (log1 scale) i scale = 11, error = 0.01 699 




Figure 8. We threshold the wavelet coefficients of the Oscillat- 
ing2DWave data at the level of .01 and prune the dyadic tree 
accordingly. The figure, from left to right, respectively shows the 
reduced matrix of wavelet coefficients (only their magnitudes), and 
the corresponding best approximation of the manifold. 

of the tree and of the wavelet dimensions in future work to lead to a more efficient 
representation in this case. 

We then fix a data point (or equivalently an image), for each digit, and show in 
Fig. [12] its reconstructed coordinates at all scales and the corresponding dictionary 



16 



WILLIAM K. ALLARD, GUANGLIANG CHEN, AND MAURO MAGGIONI 




Figure 9. Left: the compression ratio of the matrix of the wavelet 
coefficients shown in Fig. [6l Right: the corresponding approxima- 
tion errors. The linearity is consistent with Theorem 12. 3[ and 
essentially says that thresholding level S generates approximation 
errors of order at most 0{6). 
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Figure 10. Some examples of the MNIST digits 1 and (left) 
and 6000 sample images shown in top three PCA dimensions (right) 

elements (all of which are also images). We see that at every scale we have a 
handwritten digit, which is an approximation to the fixed image, and those digits 
are refined successively to approximate the original data point. The elements of the 
dictionary quickly fix the orientation and the thickness, and then they add other 
distinguishing features of the image being approximated. 

4.2.2. Human Face Images. We consider the cropped face images in both the Yale 
Face Database Hi and the Extended Yale Face Database which are available for 
38 human subjects each seen in frontal pose and under 64 illumination conditions. 
(Note that the original images have large background variations, sometimes even 
for one fixed human subject, so we decide not to use them and solely focus on the 
faces.) Among these 2432 images, 18 of them are corrupted, which we discard. 
Fig. [m displays a random subset of the 2414 face images. Since the images have 
large size (192 x 168), to reduce computational complexity we first project the 

^http :77cvc . yale . edu/projects/yalef acesB/yalef acesB .html 
^http : //vision . ucsd . edu/~ leekc/ExtYaleDatabase/ExtYaleB . htrnl] 
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Figure 11. Top left: geometric wavelet representation of the 
MNIST digits 1 and 0. As usual, the vertical axis multi-indexes the 
wavelet coefficients, from coarse (top) to fine (bottom) scales: the 
block of entries at (x, j),x G Cj^k is logio l^oA^ where qj^x is the 
vector of geometric wavelet coefficients of x at scale j (see Sec. [3]). 
In particular, each row indexes multiple wavelet elements, one for 
each k e JCj. Top right: dimensions of the wavelet subspaces (with 
the same convention as in the previous plot). Bottom: magnitude 
of coefficients (left) and reconstruction error (right) as functions of 
scale. The red lines are fitted omitting the first and last points (in 
each plot) in order to more closely approximate the linear part of 
the curve. 



images into the first 500 dimensions by SVD, keeping about 99.5% variance. We 
apply the algorithm to the compressed data to construct the geometric wavelets 
and show the wavelet coefficients, dimensions and reconstruction errors at all scales 
in Fig. [T4j Again, we have kept 50% and 95% of the variance, respectively, at the 
nonleaf and leaf nodes when constructing scaling functions. Note that both the 
magnitudes of the wavelet coefficients and the approximation errors have similar 
patterns with those for the MNIST digits (see Fig. [11]), indicating again a lack of 
manifold structure in this data set. We also fix an image and show in Fig. [15] its 
reconstructed coordinates at all scales and the corresponding wavelet bases (all of 
which are also images). 
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Figure 12. Left column: in each figure we plot coarse-to-fine geo- 
metric wavelet approximations of the original data point (repre- 
sented in the last image). Right column: elements of the wavelet 
dictionary (ordered from coarsest to finest scales) used in the ex- 
pansion on the left. 




Figure 13. Left: A random subset of the 2414 face images (38 
human subjects in frontal pose under 64 illumination conditions); 
Right: the entire data set shown in top three PC A dimensions. 



5. Orthogonal Geometric Multi-Resolution Analsysis 

Neither the vectors QMj+i{x), nor any of the terms that comprise them, are 
general orthogonal across scales. On the one hand, this is natural since A4 is 
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Figure 14. Top left: magnitudes of the wavelet coefficients of the 
cropped faces (2414 images) arranged in a tree. Top right: dimen- 
sions of the wavelet subspaces. Bottom: magnitude of coefficients 
(left) and reconstruction error (right) as functions of scale. The 
red hues are fitted omitting the first and last points (in each plot) 
in order to more closely approximate the linear part of the curve. 



nonlinear, and the lack of orthogonality here is a consequence of that. On the 
other hand, the QMj+ii^) be almost parallel across scales or, for example, 
the subspaces Wj-^i^x share directions across scales. If that was the case, 

we could more efficiently encode the dictionary by not encoding shared directions 
twice. A different construction of geometric wavelets achieves this. We describe 
this modification with a coarse-to-fine algorithm, which seems most natural. We 
start at scales and 1, letting 

(5.1) So,x = Vo,x , = So,x © Wi^a: , Ui^x = Wi^a:, 

and for j > 1, 

(5.2) Uj-^i^x = ^sf^ {Wj-^i^x) , ^j-\-i,x = Sj^x © Ujj^i^x 

Observe that the sequence of subspaces Sj^x is increasing: Sq^x ^ Si^x ^ • • • ^ 
Sj^x ^ • • • and the subspace Uj j^i^x is exactly the orthogonal complement of Sj^x 
into Sj^i^x' This is a situation analogous to that of classical wavelet theory. Also, 
we may write 



(5.3) 
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Figure 15. Left: in images 1-9 we plot coarse-to-fine geomet- 
ric wavelet approximations of the projection and the original data 
point (represented in the last two images). Right: elements of the 
wavelet dictionary (ordered from coarse to fine in 1-9) used in the 
expansion on the left. 



where the direct sum is orthogonal. At each scale j we do not need to construct 
a new wavelet basis for each Wj-^i^x^ but we only need to construct a new basis 
for Uj-^i^x^ and express Qj-\-i^x{x) in terms of this new basis, and the wavelet and 
scaling function bases constructed at the previous scales. This reduces the cost 
of encoding the wavelet dictionary as soon as dim{Uj-^i^x) < dim{Wj-^i^x) which, 
as we shall see, may occur in both artificial and real world examples. From a 
geometrical perspective, this roughly corresponds to the normal space to at a 
point not varying much at fine scales. 

Finally, we note that we can define new projections of a point x into these 
subspaces Sj^x' 

(5.4) Sj^x = PSj,^ {X - Cj^x) + Cj^x- 

Note that since Vj^x ^ Sj^x^ Sj^x is a better approximation than xj to x at scale j 
(in the least squares sense). Also, 

(5.5) Sj^i^x - Sj^x = Uj^i^xUj^i^^{x - Cj^i^x) + (^ - PSj,J{Cj^i^x - Cj^x)' 

We display in Figs. [161 [13 [181 pseudo-codes for the orthogonal GMRA and the 
corresponding forward and inverse transforms. The reader may want to compare 
with the corresponding routines for the regular GMRA construction, displayed in 
Figs.[2l[3l[4l Note that as the name suggests, the wavelet bases "^j^k along any path 
down the tree are mutually orthogonal. Moreover, the local scaling function at each 
node of such a path is effectively the union of the wavelet bases of the node itself 
and its ancestors. Therefore, the Orthogonal GMRA tree will have small height if 
the data set has a globally low dimensional structure, i.e., there is small number of 
normal directions in which the manifold curves. 

Example: A connection to Fourier analysis. Suppose we consider the classical 
space of band-limited functions of band B: 

(5.6) BFb = {/ : supp. / C [-Btt, Btt]} . 

It is well-known that classical classes of smooth functions (e.g. W^'^) are charac- 
terized by their L^-energy in dyadic spectral bands of the form [— 2-^+^7r, —2^7r] U 
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OrthoGMRA = OrthogonalGMRA (Xn,ro,e) 
1 1 Input: 

// Xn'. a set of n samples from J\A 

/ / To: some method for choosing local dimensions 

// e: precision 

// Output: 

//A tree T of dyadic cells {Cj^k} with their local means {cj,A;}, and a family of 
orthogonal geometric wavelets {C/j,^}, and corresponding translations {wj^k} 

Construct the cells Cj^k^ and form a dyadic tree T with local centers Cj^k- 

Let covo,fc = \CoA~^Y^x^Cq ~ co,fc)(x - co,fc)*, for k G /Co, and compute 

SVD(covo,A;) = ^o,k^o,k^o,k (whcrc the dimension of $o,A; is determined by tq). 

Set j = and ^o,A; ^o,k,wo^k '= co,jfc 

Let J be the maximum scale of the tree 

while j < J 

for k G JCj 

o<i<j be the union of all wavelet bases of the cell Cj^k 
and its ancestors. If the subspace spanned by ^^j^k^^ approximate the 
cell within the given precision e, then remove all the offspring of Cj^k from 
the tree. Otherwise, do the following. 

Compute covj+i^/g/ and ^j^i^k'^ for all k' G children(j, A;), as above 
For each k' G children ( j, fc), construct the wavelet bases Uj^i^y sis the 
complement of ^jj^i^y in The translation Wj^i^y is the projec- 

tion of Cjj^i^k' — Cj^k into the space orthogonal to that spanned by the 

j,k 

end 

J = J + 1 

end 



Figure 16. Pseudo-code for the construction of an Orthogonal 
Geometric Mult i- Resolution Analysis. 



{qj^x} =orthoFGWT(orthoGMRA, x) 

II Input: orthoGMRA structure, x G 

// Output: A sequence {(lj,x) of wavelet coefficients 

r = X 

for j — J down to 

r = r - (Uj^x ' qj,x + wj^x) 

end 



Figure 17. Pseudo-code for the Forward Orthogonal Geometric 
Wavelet Transform 

[2-^7r, 2-^+-^7r], i.e. by the L^-size of their projection onto BF2j+i © BF2j (some care 
is in fact needed in smoothing these frequency cutoffs, but this issue is not rele- 
vant for our purposes here). If we observe samples from such smoothness spaces, 
which kind of dictionary would result from our GMRA construction? We consider 
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X =orthoIGWT(orthoGMRA,{Q^,a,}) 

// Input: orthoGMRA structure, wavelet coefficients {qj,x} 
1 1 Output: Approximation x at scale J 

X = 

for j = to J 
end 



Figure 18. Pseudo-code for the Inverse Orthogonal Geometric 
Wavelet Transform 




Figure 19. We construct an Orthogonal Geometric Multi- 
Resolution Analysis (see Sec. [5]) on a random sample of 10000 
band- limited functions. Left: dimension of the GMRA wavelet 
subspaces. Center: approximation error as a function of scale. 
Right: dominant frequency in each GMRA subspace, showing that 
frequencies are sorted from low (top, coarse GMRA scales) to high 
(bottom, fine GMRA scales). This implies that the geometric scal- 
ing function subspaces roughly corresponds to a Littlewood-Paley 
decomposition, and the GWT of a function / corresponds to a 
rough standard wavelet transform. 

the following example: we generate random smooth (band-limited!) functions as 
follows: 

J 

(5.7) fuj{x) = ^aj{uj)cos{jx) 

3=0 

with aj random Gaussian (or bounded) with mean 2~LtJ" and standard deviation 
2~ L J J " . i . These functions are smooth and have comparable norms in a wide variety 
of smoothness spaces, e.g. W^'^, so that they may thought of as approximately 
random samples from the unit ball in such space, intersected with band-limited 
functions. We construct a GMRA on a random sample from this family of functions 
and see that it organizes this family of functions in a Littlewood-Paley type of 
decomposition: the scaling function subspace at scale j roughly corresponds to 
BF2j+i BF23 ^ and the GMRA of a point is essentially a block Fourier transform, 
where coefficients in the same dyadic band are grouped together. This is as expected 
since the geometry of this data set is that of an ellipsoid with axes of equal length 
in each dyadic frequency band, and decreasing length as j increases. It follows that 
the coefficients in the FGWT of a function / measure the energy of / in dyadic 
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bands in frequency, and is therefore an approximate FFT of sorts. Finally, observe 
that the cost of the FGWT of a point / is comparable to the cost of the Fast Fourier 
Transform. 

6. VARIATIONS, GREEDY ALGORITHMS, AND OPTIMIZATIONS 

We discuss several techniques for reducing the encoding cost of the geometric 
wavelet dictionary and/or speeding up the decay of the geometric wavelet coeffi- 
cients. 

6.1. Splitting of the wavelet subpaces. Fix a cell Cj^k- For any G children(j, /c), 
we may reduce the cost of encoding the subspace Wj-^i^k' by splitting it into a part 
that depends only on (j, k) and another on (j + 1, /c'): 

(6.1) Wp^j^ := n/e/echildren(j,/e)^j+l,fe' 

and WjYi /c' be the orthogonal complement of Wpj^ in Wj-^i^k'- We may choose 
ort ho normal bases and ^j^i for Wpj^ and W^-^ ^, respectively, and let 
Qf^i k' be the associated orthogonal projections. For the data in Cj+i^/g/, we have 
therefore constructed the geometric wavelet basis 

(6.2) vI/,+i,fc, = [vl/2fc|fj-+i.fe,], 
together with orthogonal splitting of the projector 

(6.3) Qj+i,k' = + Qf+i,k'^ 

where the first term in the right-hand side only depends on the parent (j, /c), and 
the children-dependent information necessary to go from coarse to fine is encoded 
in the second term. This is particularly useful when dim {w^^ is large relative to 
dim(W^+i,/e/). 

6.2. A fine-to-coarse strategy with no tangential corrections. In this varia- 
tion, instead of the sequence of approximations Xj = '^Vj^^ (^) to a point x G A^, we 
will use the sequence Xj = IPy^ (xj+i), for j < J, and xj := xj. The collection A4j 
of Xj for all X G is a coarse approximation to the manifold A4 at scale j. This 
roughly corresponds to considering only the first term in ()2.17p . disregarding the 
tangential corrections. The advantage of this strategy is that the tangent planes 
and the corresponding dictionary of geometric scaling functions do not need to be 
encoded. The disadvantage is that the point Xj does not have the same clear-cut 
interpretation as Xj has, as it is not anymore the orthogonal projection of x onto 
the best (in the least square sense) plane approximating Cj^x- Moreover, xj really 
depends on J: if one starts the transform at a different finest scale, the sequence 
changes. Notwithstanding this, if we choose J so that ||xj — x|| < e, for some 
precision e > 0, then this sequence does provide an efficient multi-scale encoding of 
xj (and thus of x up to precision e). 

The claims above become clear as we derive the equations for the transform: 

(6-4) ' 

= (/ - ^j,x^lx) ((^j+1 - Ci+l,x) + {Cj^l,x - Cj^x)) . 

Noting that Xj+i — Cj-^i^x ^ we obtain 

(6.5) QM.i^j+l) = (^i+l - Cj+l,x) + Wj^i^x^ 
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where ^j+i,^, are the same as in (|2.18p . By definition we still have the 
multi-scale equation 

(6.6) Xj^i = Xj + Q^.^^fe+i) 

for {xj} defined as above. 

6.3. Out-of-sample extension. In many applications it will be important to ex- 
tend the geometric wavelet expansion to points that were not sampled, and/or to 
points that do not lie exactly on A4. For example, A4 may be composed of data 
points satisfying a model, but noise or outliers in the data may not lie on A4. 

Fix X G M^, and let J be the finest scale in the tree. Let cj^x be a closest point 
to X in the net {cj^k}ke)Cj'', such a point is unique if x is close enough to A4. For 
j < we will let (j, be the index of the (unique) cell at scale j that contains 
cj,x- With this definition, we may calculate a geometric wavelet expansion of the 
point Pj^x{x). However, ej{x) := x — Pj^x{x) is large if x is far from M. We may 
encode this difference by greedily projecting it onto the family of linear subspaces 
Wj^x, "-.Wi^x and Vq^x, i.e. by computing 

Qm^,j{x) '= Qj,x{ej{x)), 



(6.7) Qm^A^) •= ^o,x(ej(x) - Qm^,j{A Qm^AA)- 

These projections encode, greedily along the multi-scale "normal" subspaces {QjA- 
The computational complexity of this operation is comparable to that of com- 
puting two sets of wavelet coefficients, plus that of computing the nearest neighbor 
of X among the centers {cjAkeJCj at the finest scale. By precomputing a tree for 
fast nearest neighbor computations, this essentially requires 0(log(|/Cj|)) opera- 
tions. Also, observe that \JCj\ in general does not depend on the number of points 
n, but on the precision in the approximation specified in the tree construction. 

6.4. Spin-cycling: multiple random partitions and trees. Instead of one 
multi-scale partition and one associated tree, in various situations it may be advan- 
tageous to construct multiple multi-scale partitions and corresponding trees. This 
is because a single partition introduces somewhat arbitrary cuts and possible re- 
lated artifacts in the approximation of A^, and in the construction of the geometric 
wavelets in general. Generating multiple partitions or families of approximations 
is a common technique in signal processing. For example, in [67] it is shown that 
denoising by averaging the result of thresholding on multiple shifted copies of the 
Haar system is as optimal (in a suitable asymptotic, minimax sense) as performing 
the same algorithm on a single system of smoother wavelets (and in that paper the 
technique was called spin- cycling). In the study of approximation of metric spaces 
by trees ^ , it is well understood that using a suitable weighted average of metrics 
of suitably constructed trees is much more powerful than using a single tree (this 
may be seen already when trying to find tree metrics approximating the Euclidean 
metric on an interval). 

In our context, it is very natural to consider a family of trees and the associated 
geometric wavelets, and then perform operations on either the union of such geo- 
metric wavelet systems (which would be a generalization of sorts of tight frames, 
in a geometric context), or perform operations on each system independently and 
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then average. In particular, the construction of trees via cover trees [57 is very 
easily randomized, while still guaranteeing that each instance of such trees is well- 
balanced and well-suited for our purposes. We leave a detailed investigation to a 
future publication. 

7. Data representation and compression 

A generic point cloud with n points in can trivially be stored in space Dn. 
If the point cloud lies, up to, say, a least-squares error (relative or absolute) e in a 
linear subspace of dimension de <^ we could encode n points in space 

(7.1) Dde + nde^ =de{D^n), 

cost of cost of encoding 

encoding basis n points 

which is clearly much less than nD. In particular, if the (i- dimensional point cloud 
lies is a d-dimensional subspace, then d^ = d and 

(7.2) d{D + n) . 

Let us compute the cost of encoding with a geometric multi-resolution analysis 
a manifold M of dimension d sampled at n points, and fix a precision e > 0. We 
are interested in the case n ^ +oo. The representation we use is, as in ()2.2Qp : 

J 

(7.3) X xj = Pmo (^) + ^ Qmj (x) , 

where we choose the smallest J such that | |x — 1 1 < e. In the case of a manifold, 
J = log2e~2 because of Theorem 12.31 However, d^ as defined above with global 
SVD may be as large as D in this context, even for d = 1. 

Since M is nonlinear, we expect the cost of encoding a point cloud sampled from 
M to be larger than the cost ()7.2p of encoding a dimensional fiat M; however 
the geometric wavelet encoding is not much more expensive, having a cost: 

(7.4) dD + 2e-^d^ + 2-^^^^ + 2)D + nd{l + log2 e'^) 

V ' V ' 

cost of cost of encoding 

encoding basis n points 

In Sec. 17.21 we compare this cost with that in 17. li on several data sets. To see that 
the cost of the geometric wavelet encoding is as promised, we start by counting the 
geometric wavelet coefficients used in the multi-scale representation. Recall that 
^J,x ~ ^^^^{"^jjx) is the number of wavelet coefficients at scale j for the given point 
X. Clearly, dj^ < d. Then, the geometric wavelet transform of all points takes 
space at most 

J 

(7.5) nd ^^^d^^ < nd ^ ndJ < nd{l + log2 e"^), 

j = l X 

independently of D. The dependency on n, d is near optimal, and this shows that 
data points have a sparse, or rather, compressible, representation in terms of geo- 
metric wavelets. Next we compute the cost of the geometric wavelet dictionary, 
which contains the geometric wavelet bases ^j,/c, translations Wj^ki and cell centers 
Cj^k' If we add the tangential correction term as in ()2.18p . then we should also 
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include the geometric scahng functions ^j^k in the cost. Let us assume for now 
that we do not need the geometric scahng functions. Define 

(7.6) d2, := rank(vl/2fc), 

(7.7) 4+i_fc, :=rank(«'^^+i^fc,) 

and assume that c?^^. < d'~', c^j+i^fc' < d^ for fixed constants d'^,d^ < d. The cost 
of encoding the wavelet bases {'^j,k}keiCj,o<j<j is at most 
j-i 

cost of ^o,fc # cells cost of # cells cost of 

at scale j ^^,k scale j + 1 

odJ _ 1 , 

(7.8) =dD^ Y^^""^ ^ <dD^ 2e-^{d^ + 2-'^d^)D. 

The cost of encoding wj^k^cj^k is 

J 

(7.9) 2 < 2 • 2^^+^ . D = ADe' i . 

Therefore, the over ah cost of the dictionary is 

(7.10) dD + 2e-i{d^ + 2"^^^^ + 2)D. 

In the case that we also need to encode the geometric scaling functions ^j,/c, we 
need an extra cost of 

J 

(7.11) ^2^^dL> < 2e-idD. 

j=o 

7.1. Pruning of the geometric wavelets tree. In this section we discuss how 
to prune the geometric wavelets tree with the goal of minimizing the total cost 
for e-encoding a given data set, i.e., encoding the data within the given precision 
e > 0. Since we are not interested in the intermediate approximations, we will 
adpot the GMRA version without adding the tangential corrections (see Sec. 16. 2p 
and thus there is no need to encode the scaling functions. The encoding cost includes 
both the cost of the dictionary, defined for simplicity as the number of dictionary 
elements {"^j^k^^j^k^cj^k} multiplied by the ambient dimension and the cost 
of the coefficients, defined for simplicity to be the number of nonzero coefficients 
required to reconstruct the data up to precision e. 

7.1.1. Discussion. We fix an arbitrary nonleaf node Cj^k of the partition tree T and 
discuss how to e-encode the local data in Cj^k in order to achieve minimal encoding 
cost. We assume that the data in the children nodes Cj-^i^k'^k^ ^ children(j, /c), 
has been optimally e-encoded by some methods, with scaling functions $^+1^/^/ of 
dimensions <ij+i,/c' and corresponding encoding costs (/Pj+i^fc/. For example, when 
Cj-^i^k' is a leaf node, it can be optimally e-encoded by using a local PGA plane of 
minimal dimension d^^^ with the corresponding encoding cost 

(7.12) ^j+i,k' = n^-+i,/c' • c^^Vi,/c' + D • ^j+i,/c' + 
where Uj j^i^y is the size of this node. 

We consider the following ways of e-encoding the data in Cj^k'- 
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(I) using the existing methods for the children Cj-^i^k' to encode the data in 
Cj^k separately; 

(II) using only the parent node and approximating the local data by a PCA 

plane of minimal dimension (i^- ^ (with basis ^^^/c); 
(III) using a multi-scale structure to encode the data in the node Cj^k, with the 
top djf^ PCA directions being the scaling function at the parent node 
and d'j'_^^ ^, dimensional wavelets encoding differences between ^j+i,/c/ and 
Here, < d^, < dl^. 
We refer to the above methods as children- only encoding, parent- only encod- 
ing and wavelet encoding, respectively. We make the following comments. First, 
method (I) leads to the sparsest coefficients for each point, while method (II) pro- 
duces the smallest dictionary. Second, in method (III), it is possible to use other 
combinations of the PCA directions as the scaling function for the parent, but we 
will not consider those in this paper. Lastly, the children-only and parent-only 
encoding methods can be thought of corresponding to special cases of the wavelet 
encoding method, i.e., when = and d^^^ = d^-^^^ respectively. 

We compare the encoding costs of the three methods above. Suppose there are 
rij^k points in the node Cj^k and n^+i^fc/ points in each Cj+i^/^/, so that Uj^k = 
J2k' ^j+i,k'- When we encode the data in Cj^k with a djj^ dimensional plane, we 
need space 

(7.13) rij^k'dl^ + D-dl^ + D. 

If we use the children nodes to encode the data in Cj,^, the cost is 

k' 

The encoding cost of the wavelet encoding method has a more complex formula, and 
is obtained as follows. Suppose that we put at the parent node a d^^^ dimensional 
scaling function consisting of the top d^j^ principal vectors, where < d^j^ < d^- j^^ 
and that ^ jj^i^k' are the corresponding wavelet bases for the children nodes. Let 
d^]^ > be the dimension of the intersection of the wavelet functions, and write 

dj-^i^k' ~ ^j,/e + ^f-j-i,k" Note that the intersection only needs to be stored once for 
all children. Then the overall encoding cost is 

^7,k = y^j+i,k' - dj+i^k' {nj+i,k' -\-D) + rij^k - djj^ -\- D - djj^ -\- D 



V ^ ^ the parent 

children excluding the scaling functions and coefficients 

+ ^j,k • d^k ^D-d^k + X ^^+1'^' • + ^ • + ^ 
^ ' k' 

intersection of children wavelets v ✓ 



children- specific wavelets 



parent and children intersection 



new cost for children 



(7.15) 



parent center and wavelet translations 
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Once the encoding costs in fTT^ , and (fTTS)) (for all < djj^ < d'. j^) are 

all computed, we pick the method with the smallest cost for encoding the data in 
Cj,/c, and also update ^j^k^^j,k correspondingly. We propose in the next section a 
pruning algorithm for practical realization of the above ideas. 

7.1.2. A pruning algorithm. The algorithm requires as input a data set Xn and a 
precision parameter e > 0, and outputs a forest with ortho normal matrices {^j,k} 
and {^j,/c} attached to the nodes and an associated cost function (fj^k defined on 
every node of the forest quantifying the cost of optimally e-encoding the data in 
that node. 

Our strategy is bottom-up. That is, we start at the leaf nodes and e-encode 
them by using local PGA planes of minimal dimensions, and let {^j,k} and {^j^k} 
be their bases and corresponding encoding costs. We then proceed to their parents 
and determine the optimal way of encoding them using (|7.13p . (j7.14p and (|7.15p . 
If the parent-only encoding achieves the minimal encoding cost, then we remove 
all the offspring of this node from the tree, including the children. If the children- 
only is the best, then we separate out the children subtrees from the tree and form 
new trees (we also remove the parent from the original tree and discard it). Note 
that these new trees are already optimized, thus we will not need to examine them 
again. If the wavelet encoding with some ^Jj^ (and corresponding wavelet bases 
^j+i,/cO does the best, then we update ^j^k '= [^J,k ^j,/c] ^j^k accordingly and 
let ^j+i^fc/ store the complement of <l>^^ in We repeat the above steps for 

higher ancestors until we reach the root of the tree. We summarize these steps in 
Fig. [20] below. 

7.2. Comparison with SVD. In this section we compare our algorithm with Sin- 
gular Value Decomposition (SVD) in terms of encoding cost for various precisions. 
We may think of the SVD, being a global analysis, as providing a sort of Fourier 
geometric analysis of the data, to be contrasted with our GMRA, a multi-scale 
wavelet analysis. We use the two real data sets above, together with a new data 
set, the Science News, which comprises about 1100 text documents, modeled as 
vectors in 1000 dimensions, whose i-th entry is the frequency of the i-th word in 
a dictionary (see [30j for detailed information about this data set). For GMRA, 
we now consider three different versions: (1) the regular GMRA, but with the op- 
timization strategies discussed in Sees. 16.11 and 16.21 (2) the orthogonal GMRA (in 
Sec.EI and (3) the pruning GMRA (in Sec.EI]). For each version of the GMRA, we 
threshold the wavelet coefficients to study the rates of change of the approximation 
errors and encoding costs. We present three different costs: one for encoding the 
wavelet coefficients, one for the dictionary, and one for both (see Fig. [2T]) . 

We compare these curves with those of SVD, which is applied in two ways: first, 
we compute the SVD costs and errors using all possible PGA dimensions; second, 
we gradually threshold the full SVD coefficients and correspondingly compress the 
dictionary (i.e., discard those multiplying identically zero coefficients). The curves 
are superposed in the same plots (see the black curves in Fig. [2Tj). 

8. Computational considerations 

The computational cost may be split as follows. 
Construction of proximity graph: we find the k nearest neighbors of each of the 
n points. Using fast nearest neighbor codes (e.g. cover trees [5X and references 
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PrunGMRA = PruningGMRA (Xn.e) x 
/ / Input: 

/ / Xn : a set of n samples from M 
/ / e: precision 

// Output: 

//A forest T of dyadic cells {Cj^k} with their local means {cj^k} and PCA bases 
{$j,/c}, and a family of geometric wavelets {^j^k}, {wj,k}, as well as encoding costs 
{^j,k}, associated to the nodes 

Construct the dyadic cells Cj,fc, and form a tree T with local centers cj^k- 
For every leaf node in the tree T, compute the minimal dimension d^j j. and corre- 
sponding basis and encoding costs cpj^k for achieving precision e 
for j — J — 1 down to 1 

Find all the nonleaf nodes of the tree T at scale j 
For each of the nodes (j, k),k G JCj, 

(1) Compute the encoding costs of the three methods, i.e., parent-only, 
children-only, and wavelet, using equations ()7.13p . (|7.14p and ()7.15p . 

(2) Update cpj^k with the minimum cost, 
if parent-only is the best, 

delete all the offspring of the node from T, and let ^j^k = 
elseif children-only is the best, 

separate out the children subtrees from T and form new trees, and also 

remove and discard the parent node 

else 

update := [^^^^^fc] and (pj^k accordingly and let $^+1,^/ store the 

complement of ^^^j^ in $^+1^^/. 

end 

end 



Figure 20. Pseudo-code for the construction of the Pruning Geo- 
metric Wavelets 



therein) the cost is Od^oi'^logn)^ with the constant being exponential in the 
intrinsic dimension of A^, and linear in the ambient dimension. The cost of 
computing the weights for the graph is 0{knD). 

Graph partitioning: we use METIS [56 to create a dyadic partition, with cost 
0{kn\ogn). We may (albeit in practice we do not) compress the METIS tree into 
a 2'^-adic tree; however, this will not change the computational complexity below. 
Computation of the ^j^k ^s: At scale j each cell Cj^k of the partition has a number 
of points rij^k = 0(2~-^^n), and there are \JCj\ = 0{2^^) such Cj^s. The cost 
of computing the rank-(i SVD in each Cj^k is 0{nj^kDd)^ by using the algorithms 
of [69]. Summing over j = 0, 1, . . . , J with J ~ log2d n we obtain a total cost 
0{Dnlogn). At this point we have constructed all the ^j,fe's. Observe that instead 
of J ~ log2d n we may stop at the coarsest scale at which a predetermined precision 
e is reached (e.g. J ~ log2 ^ for a smooth manifold). In this case, the cost of this 
part of the algorithm only depends on e and is independent of n. A similar but 
more complex strategy that we do not discuss here could be used also for the first 
two steps. 
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Figure 21. Cost-error curves for different kinds of encoding costs 
(left to right columns: overall, coefficients, dictionary) obtained 
on the three real data sets (top to bottom rows: MNIST digits, 
Yale Faces, and Science News) by the GMRA and SVD algorithms 
(represented by different curves in different colors). We see that 
all GMRA versions outperform SVD and its thresholding version 
in terms of coefficient costs (middle column), but take more space 
to store the dictionary (right column). This makes sense from the 
sparse coding perspective. Overall, the pruning GMRA algorithm 
does the best, while the other two GMRA versions have very close 
performance with both versions of SVD (see left column). 



Computation of the ^j^k^s: For each cell Cj,^, where j < J, the wavelet bases 
^j+i^fc/, k' G children(jf, k) are obtained by computing the partial SVD of a (i x 2^d 
matrix of rank at most which takes 0{D • 2^d • d). Summing this up over all 
j < J, we get a total cost of 0{nDd^). 
Overall, the algorithm costs 

(8.1) 0{nD{\og{n) + d^)) + OdA^\ogn) . 

The cost of performing the FGWT of a point (or its inverse) is the sum of the costs 
of finding the closest leaf node, projecting onto the corresponding geometric scaling 
function plane, and then computing the multi-scale coefficients: 

(8.2) Od{D\ogn)^ dD + ©(d^loge"*) , 

^ V ' V ' 

cost of findine; . cost of cost of multi-scale 

nearest cj,, projecting on ^j,^ transform 
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Figure 22. Timing experiments for the construction of geo- 
metric wavelets. We record separately the time to construct 
the nearest neighbor graph ('Graph'), the multi-scale partitions 
('Tree'), and the geometric wavelets ('Geom. Wav.'). Left: time 
in miliseconds (on the vertical axis, in log;LO scale) vs. n (on 
the horizontal axis, also log^^g scale) for S^(n,I),<j), for n = 
1000,2000,4000,8000,16000,32000, d = S, D = 100, and a = 
0, All the computational times grow linearly in n, with the 
noise increasing the computational time of each sub-computation. 
Center: same as left, but with D = 1000. A comparison with the 
experiment on the left shows that the increased ambient dimen- 
sionality does not cause, in this instance, almost any increase in 
the noiseless case, and in the noisy case the increase is a meager 
factor of 10, which is exactly the cost of handling vectors which 
are 10 times larger in distance computations, with no curse of am- 
bient dimensionality. Right: computation times as a function of 
intrinsic dimension: we vary d = 2, 4, 8, 16, 32 (in log^o scale on the 
horizontal axis)), and notice a mild increase in computation time, 
but with higher variances in the times for the computation of the 
multi-scale partitions. 

with the Od in the first term subsuming an exponential dependence on d. The cost 
of the IGWT is similar, but without the first term. 

We report some results in practical performance in Fig. [22j 

9. A NAIVE ATTEMPT AT MODELING DISTRIBUTIONS 

We present a simple example of how our techniques may be used to model 
measures supported on low-dimensional sets which are well- approximated by the 
multi-scale planes we constructed; results from more extensive investigations will 
be reported in an upcoming publication. 

We sample n training points from a point cloud A4 and, for a fixed scale j, 
we consider the coarse approximation A4j (defined in (|2.10p ). and on each local 
linear approximating plane Vj^k we use the training set to construct a multi-factor 
Gaussian model on Cj^k' let iVj^k be the estimated distribution. We also estimate 
from the training data the probability 7Tj{k) that a given point in Ai belongs to 
Cj^k (recall that j is fixed, so this is a probability distribution over the \JCj \ labels of 
the planes at scale j). We may then generate new data points by drawing a /c G /C j 
according to TTj, and then drawing a point in Vj^k from the distribution iTj^k- this 
defines a probability distribution supported on M.j , that we denote by pmj • 
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Figure 23. We generate a family of multi-scale models {pijf^i, 
from 500, 1000, 2000, 4000 (corresponding to z = 1, . . . , 4) training 
samples from the swiss-roll manifold. Left: the blue points are 1000 
training points, the red points are 4000 points generated according 
to p2 at the finest scale j = 6. Right: for each i = 1, ... ,4 and 
each scale j, we generate from pi at scale j a point cloud of 4000 
samples, and measure its Hausdorff distance (dotted lines) and 
"Hausdorff median distance" (continuous lines) from a randomly 
generated point cloud with 4000 points from the true distribution 
on the Swiss roll. The x-axis is the scale j of the model used, and 
colors map the size of the training set. The construction of these 
models and the generation of the points clouds takes a few seconds 
on a standard desktop. 

In this way we may generate new data points which are consistent with both the 
geometry of the approximating planes Vj^k and with the distribution of the data 
on each such plane. In Fig. [23] we display the result of such modeling on a simple 
manifold. In Fig. [23] we construct p^j by training on 2000 handwritten 7's from 
the MNIST database, and on the same training set we train two other algorithms: 
the first one is based on projecting the data on the first aj principal components, 
where aj is chosen so that the cost of encoding the projection and the projected 
data is the same as the cost of encoding the GMRA up to scale j and the GMRA 
of the data, and then running the same multi-factor Gaussian model used above 
for generating iTj^k- This leads to a probability distribution we denote by psvDj- 
Finally, we compare with the recently-introduced Multi-Factor Analyzer (MFA) 
Bayesian models from [39j. In order to test the quality of these models, we consider 
the following two measures. The first measure is simply the Hausdorff distance 
between 2000 randomly chosen samples according to each model and the training 
set: this is measuring how close the generated samples are to the training set. The 
second measure quantifies if the model captures the variability of the true data, and 
is computed by generating multiple point clouds of 2000 points for a fixed model, 
and looking at the pairwise Hausdorff distances between such point clouds, called 
the within-model Hausdorff distance variability. 

The bias- variance tradeoff in the models pMj is the following: as j increases the 
planes better model the geometry of the data (under our usual assumptions), so 
that the bias of the model (and the approximation error) decreases as j increases; 
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Figure 24. A training set of 2000 digits 7 from the MNIST data 
set are used to train probability models with GMRA {pMji oiie 
for each scale j in the GMRA of the training set), SVD {psvDj^ 
one for each GMRA scale, see text), and MFA pmfa- Left: 32 
digits drawn from pm5^ Psvd^ and pmfa' the quality of pmb and 
Pmfa is qualitatively better than that of psvDb'^ moreover pmb 
seem to capture more variability than Pmfa- Center: plots of the 
Hausdorff distance to training set and in-model Hausdorff distance 
variability. We see that both pMj and Pmfa have similar distance 
to the training set, while psvDj^ being a model in the ambient 
space, generates points farther from the distribution. Looking at 
the plots of the in-model Hausdorff distance variability, we see that 
such measure increases for p^. as a function of j (reflecting the 
increasing expression power of the model), while the same mea- 
sure for Pmfa is very small, implying that MFA fails to capture 
the variability of the distribution, and simply generates an almost 
fixed set of points (in fact, local averages of points in the training 
set), well-scattered along the training set. Timings: construction 
of GMRA and model construction for all scales for GMRA took 
approximately 1 min, for SVD 0.3 min, for MFA about 15 hrs. 
Right: a similar experiment with a training set of 2000 points 
from a swissroll shaped manifold with no noise: the finest scale 
GMRA-based models perform best (in terms of both approxima- 
tion and variability, the SVD-based models are once again unable 
to take advantage of the low-intrinsic dimension, and MFA-based 
models fail as well, to succeed they seem to require tuning the pa- 
rameters far from the defaults, as well as a much larger training 
set. Timings: construction of GMRA and model construction for 
all scales for GMRA took approximately 4 sec, for SVD 0.5 sec, 
for MFA about 4 hrs. 



on the other hand the sampling requirements for correctly estimating the density 
of Cj^k projected on Vj^k increases with j as less and less training points fall in Cj^k- 
A pruning greedy algorithm that selects, in each region of the data, the correct 
scale for obtaining the correct bias-variance tradeoff, depending on the samples 
and the geometry of the data, similar in spirit to the what has been studied in the 
case of multi-scale approximation of functions, will be presented in a forthcoming 
publication. 
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10. Future work 

We consider this work as a first "bare bone" construction, which may be refined 
in a variety of ways and opens the way to many generahzations and apphcations. 
For example: 

• User interface. We are currently developing a user interface for interact- 
ing with the geometric wavelet representation of data sets [70] . 

• Higher order approximations. One can extend the construction pre- 
sented here to piecewise quadratic, or even higher order, approximators, 
in order to achieve better approximation rates when the underlying set is 
smoother than C^. 

• Better encoding strategies for the geometric wavelet tree. The 

techniques discussed in this paper are not expected to be optimal, and 
better tree pruning/tuning constructions may be devised. In particular, to 
optimize the encoding cost of a data set, the geometric wavelet tree should 
be pruned and slightly modified to use a near-minimal number of dictionary 
elements to achieve a given approximation precision e. 

• Sparsifying dictionary. While the approximation only depends on the 
subspaces {^j,k)^ the sparsity of the representation of the data points will 
in general depend on the choice of ^j^k and ^j,/c, and such choice may 
be optimized ("locally" in space and in dimension) by existing algorithms, 
thereby retaining both the approximation guarantees and the advantages 
of running these black-box algorithms only on small number of samples and 
in a low-dimensional subspace. 

• Probabihstic construction. One may cast the whole construction in a 
probabilistic setting, where subspaces are enriched with distributions on 
those subspaces, thereby allowing geometric wavelets to generate rich fam- 
ilies of probabilistic models. 

11. Appendix 

Proof of Theorem \2.3[ . The first equality follows by recursively applying the two- 
scale equation (j2.19p . so we only need to prove the upper bound. We start with 
the case p = +oo. By compactness, for every x ^ M and for jo large enough and 
j > jo, there is a unique point Zj^x ^ closest to Cj^x^ and Cj^x is the graph 
of a C^+" function / := fj^x • Pt^. ^{Cj^x) ^j,x^ where Tzj^^{A4) is the plane 
tangent to A4 at Zj^x- Note that this is true whether we construct dyadic cells Cj^x 
with respect to the manifold metric p, or by intersecting Euclidean dyadic cubes 
with A4. The following calculations are in the spirit of those in [8 . Since all the 
quantities involved are invariant under rotations and translations, up to a change of 
coordinates we may assume that = 0, T^^.^ = (xi, . . . ^Xd)- Assume a = 1, 

i.e. the manifold is C^. In the coordinates above the function / =: (/i, . . . , fo-d) 
above may be written 

(11.1) fi{w) = ^{W - Zj^xfHif\^.^^{w - Zj^x) + - Zj^xW^) , 

where Hi is the d x d Hessian of the i-th coordinate fi of /. The calculations in 
[8] show that, up to higher order terms, Yj^x is parallel to and differs from it 

by a translation along the normal space A^c^^, since Yj^x passes through Cj^x while 
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Tzjx passes through Zj^x- Therefore we have 



\z- PMj{z)\\j^D\ 



sup \\z -¥j^x{z)\\j^D 



zeCi, 



= sup \\Z - Pt,.^^{z - Cj^a,) - Cj^a^W^D 
z G Cj ^ X 

< sup \\{Z - Zj^^) - Pt,,^^{z - Zj^a:)\\R- 

Z^Cjx 



< sup 



-{W - Zj^xTHiflz^^^iw - Zj^x) + - Zj^xW^) 



where k 



\liA\ is a measure of extrinsic curvature, and where 



we used that Cj^x is in the convex huh of Cj^x- A similar calculation applies to the 
case where ji G C-^+", where 0(\\w — replaces the second order terms, 

and K is replaced by max^^ji ... ||V/i||c"- 
We now derive an L'^{Cj^x^ l-(^j,x) estimate: 



II Ik - Pmj{z) 

= ^ I 



-IPj^WIlL d^{z) 



mm 



^j,x) Jc, 



< 



U: an afRne d— plane /jlI^Cj ) Jc^ 
D 

^ A^(cov^-^) 

l=d+l 



\z - Pu{z)r dfi{z) 



< max 



d{d^l) 



eSD-d 4((i + 2)((i + 4) 
+ 0(2-^^-), 



D-d 



1=1 



1 



m-d 



^ wiTt{Hi) 



. 1=1 



2-4j 



where the inequality before the last follows from the fact that, up to order 2~^-^, 
there are no more than d{d + l)/2 curvature directions, and the last inequality 
follows from the bounds in [8 , which formalize the fact that the eigenspace spanned 
by the top d vectors of covj^^c is, up to higher order, parallel to the tangent plane, 
and passing through a point cj^x which is second-order close to A^, and therefore 
provides a second-order approximation to A4 at scale . This latter bounds could 
be strengthened in obvious ways if some decay of Xi{covj^k) for / = (i+ 1, . . . , d{d-\- 
l)/2 was assumed. The estimate in (|2.20p follows by interpolation between the 
estimate in and the one in L"^. □ 

The measure of curvature multiplying 2~^^ in the last bound appeared in [8^: it 
may be as large as 0{{D — d)^?)^ but also quite small depending on the eigenvalues 
of the Hessians Hi. 
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